Introduction

This is a brief demonstration of a standard latent dirichlet allocation (LDA) for topic modeling, geared toward those who want to dig a little deeper than merely getting a result from a package command. The basic idea is to first generate some documents based on the underlying model, and then we’ll use the topicmodels package to recover the topics via LDA. You can find a basic overview of topic models in a variety of places, and I’ll eventually have a chapter in my graphical and latent variable models document.

Suffice it to say one can approach this in (at least) one of two ways. In one sense, LDA is a dimension reduction technique, much like the family of techniques that includes PCA, factor analysis, non-negative matrix factorization etc. We’ll take a whole lot of terms, loosely defined, and boil them down to a few topics. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. The following is the plate diagram and description for standard LDA from Wikipedia, which follows the typical depiction in Blei’s and other references.

  • \(\alpha\) is the parameter of the Dirichlet prior on the per-document topic distributions
  • \(\beta\) is the parameter of the Dirichlet prior on the per-topic word distribution
  • \(\theta_m\) is the topic distribution for document \(m\)
  • \(\varphi_k\) is the word distribution for topic \(k\)
  • \(z_{mn}\) is the topic for the \(n\)-th word in document \(m\)
  • \(w_{mn}\) is the specific word

Both \(z\) and \(w\) are from a multinomial draw based on the \(\theta\) and \(\varphi\) distributions respectively. For a given document, one draws a topic, and given the topic, words are drawn from it.

Generating Documents

In the standard setting, to be able to conduct such an analysis from text one needs a document-term matrix, where rows represent documents, and columns terms. Terms are typically words but could be any n-gram of interest. In practice, this is where you’ll spend most of your time, as text is never ready, and must be scraped, converted, stemmed, cleaned etc. In what follows we’ll be generating the text based on the underlying topic model. We’ll essentially be fixing the priors to create \(\theta\) and \(\varphi\) noted above, then given those, draw topics and words given topics based on the multinomial distribution.

Topic Probabilities

We begin the simulation by creating topic probabilities. There will be \(k=3\) topics. Half of our documents will have probabilities of topics for them (\(\theta_1\)) which will be notably different from the other half (\(\theta_2\)). Specifically, the first half will show higher probability of topic ‘A’ and ‘B’, while the second half of documents show higher probability of topic ‘C’. What we’ll end up with here is an \(m\) x \(k\) matrix of probabilities \(\theta\) where each \(m\) document has a non-zero probability for each \(k\) topic. Note that in the plate diagram, these would come from a \(\mathcal{Dirichlet}(\mathbf{\alpha})\) draw rather than be fixed like this, but hopefully this will make things clear starting out. Later we will use the Dirichlet for the terms.

# Note that in what follows, I strive for conceptual clarity, not code efficiency.
library(tidyverse)

Ndocs = 500                                        # Number of documents
WordsPerDoc = rpois(Ndocs, 100)                    # Total words/terms in a document
thetaList = list(c(A=.60, B=.25, C=.15),           # Topic proportions for first and second half of data 
                 c(A=.10, B=.10, C=.80))           # These values represent a Dir(alpha) draw
theta_1 = t(replicate(Ndocs/2, thetaList[[1]]))
theta_2 = t(replicate(Ndocs/2, thetaList[[2]]))
theta = rbind(theta_1, theta_2)                    # Topic probabilities for all 500 docs

If you like, the following will use the Dirichlet explicitly, and amounts to the same thing but with randomness thrown in.

theta_1 = t(replicate(Ndocs/2, rdirichlet(1, c(6, 2.5, 1.5)), simplify=T))
theta_2 = t(replicate(Ndocs/2, rdirichlet(1, c(1, 1, 8)), simplify=T))

Topic Assignments and Labels

With topic probabilities in hand, we’ll draw topic assignments from a categorical distribution, which, for those not in computer science, is the multinomial with size=1 (see the commented line at the end).

firsthalf = 1:(Ndocs/2)
secondhalf = (Ndocs/2+1):Ndocs

Z = t(apply(theta, 1, rmultinom, n=1, size=1))    # draw topic assignment
colMeans(Z[firsthalf,])                           # roughly equal to theta_1
[1] 0.58 0.26 0.16
colMeans(Z[secondhalf,])                          # roughly equal to theta_2
[1] 0.124 0.088 0.788
z = apply(Z, 1, which.max)                        # topic assignment as arbitrary label 1:3
# z = apply(theta, 1, function(topprob) extraDistr::rcat(1, topprob)) # topic assignment via categorical dist

Topics

Next we need the topics themselves. Topics are probability distributions of terms, and in what follows we’ll use the Dirichlet distribution to provide the prior probabilities for the terms. With topic A, we’ll make the first ~40% of terms have a higher probability of occurring, the last ~40% go with topic C, and the middle more associated with topic B. To give a sense of the alpha settings, alpha=c(8,1,1) would result in topic probabilities of .8, .1, .1, as would alpha=c(80,10,10), though the latter would serve as a much stronger prior. We’ll use the gtools package for the rdirichlet function. I also provide a visualization, where the dark represents terms that are notably less likely to be associated with a particular topic.

Nterms = max(WordsPerDoc)
breaks = quantile(1:Nterms, c(.4,.6,1)) %>% round()
cuts = list(1:breaks[1], (breaks[1]+1):breaks[2], (breaks[2]+1):Nterms)

library(gtools)

phi_k = matrix(0, ncol=3, nrow=Nterms)

phi_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])),       # topics for 1st 40% of terms
                                    rep(1,  length(cuts[[2]])),
                                    rep(1,  length(cuts[[3]]))))
phi_k[,2] = rdirichlet(n=1, alpha=c(rep(1,  length(cuts[[1]])),       # topics for middle 20%
                                    rep(10, length(cuts[[2]])),
                                    rep(1,  length(cuts[[3]]))))
phi_k[,3] = rdirichlet(n=1, alpha=c(rep(1,  length(cuts[[1]])),       # topics for last 40%
                                    rep(1,  length(cuts[[2]])),
                                    rep(10, length(cuts[[3]]))))


Now, given the topic assignment, we draw words for each document according to its size via a multinomial draw.

wordlist_1 = sapply(1:Ndocs, 
                    function(i) t(rmultinom(1, size=WordsPerDoc[i], prob=phi_k[,z[i]]))  
                    , simplify = F)  

# smash to doc-term matrix
dtm_1 = do.call(rbind, wordlist_1)
colnames(dtm_1) = paste0('word', 1:Nterms)

# bag of words representation
wordlist_1 = lapply(wordlist_1, function(wds) rep(paste0('word', 1:length(wds)), wds))   


Here’s a glimpse of the document-term matrix.

And here’s what the first document looks like.

Term Freq
word118 5
word90 5
word97 5
word105 4
word123 4
word98 4
word101 3
word114 3


With a matrix approach, we don’t need an explicit topic label, just the topic indicator matrix and topic probabilities. I will let you see this for yourself that the results are essentially the same, but we’ll just stick to the previous results for this demo.

ZB = tcrossprod(Z, phi_k)                                                
dtm_2 = t(sapply(1:Ndocs, function(i) rmultinom(1, WordsPerDoc[i], ZB[i,])))
wordlist_2 = apply(dtm_2, 1, function(row) rep(paste0('word', which(row!=0)), row[row!=0]) )

colnames(dtm_2) = paste0('word', 1:Nterms)

Topic Models

We’re now ready to run the models. Depending on the number of documents, terms, and other settings, it is possible to get an unsatisfactory/random result. Usually a redo is enough to recover topic probabilities, but the commented settings are an attempt to get better result from the outset. However, they will notably slow things down.

library(topicmodels)
controlSettings = list(nstart=10, verbose=100)
# others var=list(iter.max=5000, tol=10e-8), em=list(iter.max=5000, tol=10e-8)
LDA_1 = LDA(dtm_1, k=3, control=controlSettings)

Let’s look at the results. To be clear, the topic labels are completely arbitrary, so what is topic “1” for one analysis might be topic “3” for another. In this case, they align such that A-1, B-2, and C-3. The main thing for us is the recovery of the distribution of the topics, which will be roughly 0.6, 0.25, 0.15 for the first half (arbitrary order), and 0.1, 0.1, 0.8 for the other. Visually we can see the expected result, the first half of the documents are mostly associated with the first topic, and a little more so for the second. The last half are mostly associated with the third topic and little else.

LDA_1Post = posterior(LDA_1)


We can compare our results with the true topic probabilities and see we’ve recovered things pretty well.

  A B C
true_topic_probs_first_half 0.60 0.25 0.15
true_topic_probs_second_half 0.10 0.10 0.80
LDA_1_first_half 0.61 0.26 0.13
LDA_1_second_half 0.10 0.09 0.81


As another approach to looking at how the documents relate to one another, I calculated their KL divergence based on the estimated topic probabilities, and show the (lower triangle) of the resulting distance matrix of documents. Darker indicates documents that are less alike, and we see the later documents are less like the earliest ones, as we would expect.

tidytext

The tidytext package can facilitate working with LDA results, as well as doing all the text pre-processing beforehand. As it is relatively new to the scene, I thought it might be useful to demonstrate some of how it works. The first step would assume your initial data is in ‘tidy’ format (or what is more typically called ‘long’ form). In this case, each row represents a word within a document. Additionally we have a column for how many times the word occurs and the document id. Note that I drop 0 counts. This is because tidytext will not create a sparse dtm object otherwise.

library(tidytext)
lda_dat_df = dtm_1 %>%
  as_data_frame() %>% 
  mutate(doc=1:n()) %>% 
  gather(key=term, value=count, -doc) %>% 
  filter(count>0) %>%
  arrange(doc)

lda_dat_df
doc term count
1 word31 1
1 word38 1
1 word56 1
1 word72 1
1 word79 1
1 word81 2
1 word83 2
1 word84 2
1 word87 2
1 word88 1


Again, tidytext will help you get to that state via the initial processing. Once there, you can then create the document term matrix with cast_dtm.

dtm = lda_dat_df %>%
  cast_dtm(doc, term, count)

dtm
<<DocumentTermMatrix (documents: 500, terms: 128)>>
Non-/sparse entries: 24990/39010
Sparsity           : 61%
Maximal term length: 7
Weighting          : term frequency (tf)

The above information tells us how much sparsity (or zero counts) we have in the matrix (61%), the actual number of zeros vs. non-zeros that produce the sparsity percentage, the maximal word/term length (7), and the weighting used. In this case we are dealing with the raw frequencies, thus the weighting is just term frequency.

Once you have the topic model results, you can then start working with term and topic probabilities via the functions in topicmodels or the ‘tidy’ version of the LDA object and tidytext. First we’ll get the top 10 terms and graphically display them. Recall that topic C-3 should be seen with later words, and A-1 with earlier, and this is born out fairly clearly. Topic B was our less frequent topic with a smaller vocabulary. See also the LDAvis for an interactive way to examine topic and term relationships.

terms(LDA_1, 10)
Topic 1 Topic 2 Topic 3
word22 word67 word95
word44 word76 word100
word15 word63 word120
word49 word74 word94
word37 word71 word86
word1 word77 word115
word11 word57 word126
word45 word72 word93
word34 word58 word96
word12 word56 word132
top_terms = tidy(LDA_1, matrix='beta') %>% 
  group_by(topic) %>%  
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


We can get at the topic probabilities for each document as follows. For this demonstration, probabilities are essentially 1 or 0, something you may not find in practice.

doctop_probs = tidy(LDA_1, matrix = "gamma")

doctop_probs %>% 
  group_by(document, topic) %>% 
  summarise(meanProb=mean(gamma)) %>% 
  head
document topic meanProb
1 1 0.0003705
1 2 0.9993
1 3 0.0003705
2 1 0.9992
2 2 0.0004034
2 3 0.0004034


A rough breakdown of expected topic probabilities across all documents is \(.6*Ndocs/2 + .1*Ndocs/2 =\) 0.35 for topic A, \(.25*Ndocs/2 + .1*Ndocs/2 =\) 0.175 for B, and \(.8*Ndocs/2 + .15*Ndocs/2 =\) 0.475 for C. We can use the topics function to extract the most likely topic per document and get the frequency of occurrence. Note that we don’t have to say that each document is only associated with one topic in general.

prop.table(table(topics(LDA_1)))

    1     2     3 
0.352 0.178 0.470 
## # dplyr is notably more verbose here, but can be used also
## doctop_probs %>%
##   group_by(document) %>%
##   summarise (topic = which.max(gamma)) %>%
##   group_by(topic) %>% 
##   summarise(n = n()) %>% 
##   mutate(freq = n / sum(n))
topic n freq
1 176 0.352
2 89 0.178
3 235 0.47

Looks like we’re right where we’d expect.

Summary

This document conceptually demonstrates the mechanics of LDA’s data generating process. I suggest you go back and fiddle with the settings to see how things change. For example, one could use an actual Dirichlet draw for theta, and a more loose setting for the term-topic affiliation for a more nuanced result.

Note that just about everything you’ll come across with LDA will regard topic modeling, but it is by no means restricted to the analysis of text. In fact, one of its earliest uses was actually in genetics. Any time you have a matrix of counts like this, LDA is a potential candidate for analysis, and might be preferable to PCA or factor analysis.

References